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Abstract 

This article gives theoretical insights into the performance of K-SVD, a dictionary learning algorithm that 
has gained significant popularity in practical applications. The particular question studied here is when a 
dictionary <£> e R dxK can be recovered as local minimum of the minimisation criterion underlying K-SVD 
from a set of N training signals y n = ®x„. A theoretical analysis of the problem leads to two types of 
identifiability results assuming the training signals are generated from a tight frame with coefficients drawn 
from a random symmetric distribution. First asymptotic results showing, that in expectation the generating 
dictionary can be recovered exactly as a local minimum of the K-SVD criterion if the coefficient distribution 
exhibits sufficient decay. This decay can be characterised by the coherence of the dictionary and the 4 -norm 
of the coefficients. Based on the asymptotic results it is further demonstrated that given a finite number 
of training samples N, such that N/ log TV = 0(K 3 d), except with probability 0(N~ Kd ) there is a local 
minimum of the K-SVD criterion within distance 0(KN~ 1/4 ) to the generating dictionary. 

Index Terms 

dictionary learning, sparse coding, K-SVD, finite sample size, sampling complexity, dictionary identification, 
minimisation criterion, sparse representation 

1 Introduction 

As the universe expands so does the information we are collecting about and in it. New and 
diverse sources such as the internet, astronomic observations, medical diagnostics etc. confront 
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us with a flood of data in ever increasing dimensions and while we have a lot of technology at 
our disposal to acquire these data, we are already facing difficulties in storing and even more 
importantly interpreting them. Thus in the last decades high-dimensional data processing has 
become a very challenging and interdisciplinary field, requiring the collaboration of researchers 
capturing the data on one hand and researchers from computer science, information theory, 
electric engineering and applied mathematics, developing the tools to deal with the data on the 
other hand. One of the most promising approaches to dealing with high-dimensional data so 
far has proven to be through the concept of sparsity. 

A signal is called sparse if it has a representation or good approximation in a dictionary, ie. a 
representation system like an orthonormal basis or frame, 0, such that the number of dictionary 
elements, also called atoms, with non-zero coefficients is small compared to the dimension of 
the space. Modelling the signals as vectors y G IR d and the dictionary accordingly as a matrix 
collecting K normalised atom-vectors as its columns, ie. $ = (4>i, . . . 4>k), <fii £ ~^ d , IHb = 1/ we 
have 



for a set / of size S, ie. |/| = S, which is small compared to the ambient dimension, ie. 

S <^d<K. 

The above characterisation already shows why sparsity provides such an elegant way of dealing 
with high-dimensional data. No matter the size of the original signal, given the right dictionary, 
its size effectively reduces to a small number of non-zero coefficients. For instance the sparsity 
of natural images in wavelet bases is the fundamental principle underlying the compression 
standard JPEG 2000. 

Classical sparsity research studies two types of problems. The first line of research investigates 
how to perform the dimensionality reduction algorithmically, ie. how to find the sparse approx- 
imations of a signal given the sparsity inducing dictionary. By now there exists a substantial 
amount of theory including a vast choice of algorithms, e.g. |[T0|, |6|, |2"3|, O, 10, together with 
analysis about their worst case or average case performance, [30J, |3lf , |[2"8|, |[T6|. The second 
line of research investigates how sparsity can be exploited for efficient data processing. So it 
has been shown that sparse signals are very robust to noise or corruption and can therefore 
easily be denoised, [12 j, or restored from incomplete information. This second effect is being 
exploited in the very active research field of compressed sensing, see |TT1, 0, II25II . 
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However, while sparsity based methods have proven very efficient for high-dimensional data 
processing, they suffer from one common drawback. They all rely on the existence of a dictionary 
providing sparse representations for the data at hand. 

The traditional approach to finding efficient dictionaries is through the careful analysis of the 
given data class, which for instance has led to the development of wavelets, [8], and curvelets, 
|4fl, for natural images. However when faced with a (possibly exotic) new signal class this 
analytic approach has the disadvantage of requiring too much time and effort. Therefore, more 
recently, researchers have started to investigate the possibilities of learning the appropriate 
dictionary directly from the new data class, ie. given N signals y n G IR d , stored as columns in 
a matrix Y = (yi, . . . , j/jv) find a decomposition 

Y « <S>X 

into a d x K dictionary matrix $ with unit norm columns and a K x N coefficient matrix with 
sparse columns. 

So far the research focus in dictionary learning has been on algorithmic development, meaning 
that by now there are several dictionary learning algorithms, which are efficient in practice 
and therefore popular in applications, see El, JHl, Q, J23, (3U, HH, |29| or £5| for a more 
complete survey. On the other hand there is only a handful of dictionary learning schemes, for 
which theoretical results are available, 0, JTSJ, (T7|, |14|, ffl8"1 . While for these schemes there are 
known conditions under which a dictionary can be recovered from a given signal class, their 
practical applicability is severely limited by their computational complexity. In [2] the authors 
themselves state that the algorithm is only of theoretical interest and also the ^-minimisation 
principle, suggested in [35], |24| and studied in |[T7|, [T4|, [[181 , is not suitable for very high- 
dimensional data. 

In this paper we will start bridging the gap between practically efficient and provably efficient 
dictionary learning schemes, by providing identification results for the minimisation principle 
underlying K-SVD (K-Singular Value Decomposition), one of the most widely applied dictionary 
algorithms. 

K-SVD was introduced by Aharon, Elad and Bruckstein in [1] as a generalisation of the K-means 
clustering process. The starting point for the algorithm is the following minimisation criterion. 
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Given some signals Y = (y±, . . . , un), y n £ ~^ d , find 

min ||Y-$X||| (1) 

<5>£T>,XgX s 

for V := {$ = (fa,... ,<t> K ),<fH G K d , ||^|| 2 = 1} and X s := {X = (x u . . . , x N ), x n G R K , ||z„|| < 
5}, where ||x||o counts the number of non-zero entries of x, and || • \\p denotes the Frobenius 
norm. In other words we are looking for the dictionary that provides on average the best 5-term 
approximation to the signals in Y. 

K-SVD aims to find the minimum of ([T]) by alternating two procedures, a) fixing the dictionary 
$ and finding a new close to optimal coefficient matrix X new column-wise, using a sparse ap- 
proximation algorithm such as (Orthogonal) Matching Pursuit or Basis Pursuit, and b) updating 
the dictionary atom-wise, choosing the updated atom cf)f ew to be the left singular vector to the 
maximal singular value of the matrix having as its columns the residuals y n — Ylk^i ^k x n(k) 
of all signals y n to which the current atom fa contributes, ie. X n i = x n (i) ^ 0. We will not 
go further into algorithmic details, but refer the reader to the original paper JTJ as well as |2J. 
Instead we concentrate on the theoretical aspects of the posed minimisation problem. 
First it will be convenient to rewrite the objective function using the fact that for any signal y n 
the best S-term approximation using $ is given by the largest projection onto a set of S atoms 
$/ = (fa, ...fa s ), ie., 



min 



Y — 3>X|| p = min min \\y n — <£x 



•s>eT>,xex s §eT> < \\x n \\ <s 



n\\2 



<5h=lV \I\<S 
% 

= \\ Y \\F-faxJ2^\\^Myn\\l 

i 

where <3?J denotes the Moore-Penrose pseudo inverse of <!>/. Abbreviating the projection onto 
the span of (fa)i<=i by Pi($>) = we can thus replace the minimisation problem in ([I]) with 

the following maximisation problem, 

™^Yl^ s \\ p i(®)yn\\l ( 2 ) 

i 

From the above formulation it is quite easy to see the motivation for the proposed learning 
criterion. Indeed assume that the training signals are all .S-sparse in an admissible dictionary 
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$ € T>, ie. Y = <&X and ||xj||o < S, then clearly there is a global maximum^] of Q at 
respectively a global minimum of <[T]) at (Q,X), as long as S" < S. However in practice we will 
be facing something like, 

y n = $x n + r n & Y = + R, (3) 

where the coefficient vectors x n in X are only approximately S'-sparse or rapidly decaying and 
the pure signals are corrupted with noise R = (ri, . . . , r/c). In this case it is no longer trivial or 
obvious that <5 is a local maximum of |2|, but we can hope for a result of the following type. 

Theorem 1.1 (Goal): Assume that the signals y n are generated as in ([3]), with x n drawn from a 
distribution of approximately sparse or decaying vectors and r n random noise. As soon as the 
number of signals N is large enough N > C, with high probability p » 1 there will be a local 
maximum of Q within distance e from 

The rest of this paper is organised as follows. We first give conditions on the dictionary and 
the coefficients which allow for asymptotic identifiability by studying when $ is exactly at a 
local maximum in the limiting case, ie. replacing the sum in Q with the expectation, 

max E„ ( max ||P/($)w||n ) . (4) 
<fev y \\i\<s" J 

Thus in Section [2] we will prove identification results for the case when in Q we have S = 1, 

ie. Xs = X\, assuming first a simple (discrete, noise-free) signal model and then progressing to 

a noisy, continuous signal model. In Section [3] we will extend these results to the case S > 1. 

Finally in Sections [4] and |5j we will go from asymptotic results to results for finite sample sizes 



and prove versions of Theorem 1.1 that quantify the sizes of the parameters e,p in terms of the 
number of training signals iV and the size of C in terms of the number of atoms K. In the last 
section we will discuss the implications of our results for practical applications, compare them 
to existing identification results and point out some directions for future research. 

2 Asymptotic identification results for S = l 
2.1 Notation 

Before we jump into the fray, a few words on notations; usually subscripted letters will denote 
vectors with the exception of c and e where they are numbers, eg. (x±, . . . , xk) = X E M. dxK vs. 

1. # is a global maximiser together with all 2 K K\ dictionaries consisting of a permutation of the atoms in $ 
provided with a ±1 sign. For a more detailed discussion on the uniqueness of the maximiser /minimiser see eg. |17|. 
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c = (ci, . . . , ck) £ however, it should always be clear from the context what we are dealing 
with. For a matrix M, we denote its (conjugate) transpose by M* and its operator norm by 

||M|| 2 ,2 = maxii^n^! ||Mx|| 2 . 

We consider a frame <& a collection of K > d vectors <fii G M. d for which there exist two positive 
constants A, B such that for all v <EM. d we have 

K 

^IMl2<^K<^>l 2 <£|Ml2- (5) 
i=i 

If B can be chosen equal to A, ie. B = A, the frame is called tight and if all elements of a tight 
frame have unit norm we have A = Kj d. 

Finally we introduce the Landau symbols O, o to characterise the growth of a function. We write 

/(e) = 0(<7(e)) if linwo /(£)/<?(£) = C < oo and f(e) = o(g(e)) if lim £ ^ f(e)/g(e) = 0. 

2.2 The problem for S = 1 

In case 5=1 the expression for which we have to maximise the expectation in Q can be 
radically simplified, ie. 

max ||P/($)y||| = max|(^,y)| 2 = 

|7|<1 i 

and the maximisation problem we want to analyse reduces to, 

maxE^ll^liy. (6) 

As mentioned in the introduction if the signals y are all 1-sparse in a dictionary <3> then clearly 
$ is a global maximiser of However what happens if we do not have perfect sparsity? Let 
us start with a very simple negative example of a coefficient distribution for which the original 
generating dictionary is not at a local maximum. 

Example 2.1: Let U be an orthonormal basis and x be randomly 2-sparse with 'flat' coeffcients, 
ie. pick two indices i,j choose ay j = ±1 uniformely at random and set Xk = &k for k = i,j 
and zero else. Then U is not a local maximum of Indeed since the signals are all 2-sparse 
the maximal inner product with all atoms in U is the same as the maximal inner product with 
only d — 1 atoms. This degree of freedom we can use to construct an ascent direction. Choose 
U e = («i, . . . ,Ud-i, {ud + eu\)/\/l + e 2 ), then we have 

= E x max {l, %g0!} = 1 + > 1 = (||^y||L) 
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From the above example we see that in order to have a local maximum at the original dictionary 
we need a signal/ coefficient model where the coefficients show some type of decay. 

2.3 A simple model of decaying coefficients 

To get started we consider a very simple coefficient model, constructed from a non-negative, 
non-increasing sequence c G IR A with ||c||2 = 1, which we permute uniformly at random and 
provide with random ± signs. To be precise for a permutation p : {1, K} — > {1, K} and a 
sign sequence a, &i = ±1, we define the sequence c pcr component-wise as c p ^(i) := aic v u\, and 
set y = <J>x where x = c P;(J with probability (2 K K\)~ 1 . 

The normalisation ||c||2 = 1 has the advantage that for dictionaries, which are an orthonormal 
basis, the resulting signals also have unit norm and for general dictionaries the signals have unit 
square norm in expectation, ie. E(||y|||) = 1. This reflects the situation in practical application, 
where we would normalise the signals in order to equally weight their importance. 
Armed with this model we can now prove a first dictionary identification result for 

Theorem 2.1: Let <I> be a unit norm tight frame with frame constant A = K/d and coherence 
Let x G M. be a random permutation of a sequence c, where c\ > C2 > C3 . . . > ck > and 
||c||2 = 1, provided with random ± signs, i.e. x = c Pi(7 with probability F(p, a) = (2 K K\)~ 1 . If 
c satisfies c\ > C2 + 2/i||c||i then there is a local maximum of Q at <E>. Moreover we have the 
following quantitative estimate for the basin of attraction around <£. For all perturbations ^ = 
(ipi . . . i(jk) of $ = (4>i . . . 4>k) with < maxj ||^ - 4>ih < e we have E :r ||^'*$x||^ < E^H^^xH^, 
as soon as e < 1/5 and 

£ < V " +Cl I (7) 

2^1og (2AK/&-%$j) 

Proof: We start by calculating the expectation of the maximally recoverable energy using 
the original dictionary $. 

E x \\^x\\ 2 co =E p E a \\^c p , a \\ 2 co 

= E p E CT ( max |(^, $Cp i<T )| : 
\i=l...K 

= E B E CT I max 

' \ i=l...K 
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To estimate the maximal inner product we first assume that p is fixed. Setting i p = p 1 (1) we 
get 

^ 1 (8) 



> Cl - £t| C 1, 



\(4>i p ,^c P: a)\ = a ip ci + a j C p(j) 

while for all i ^ i p we have 

\((j>i, $Cp )(T }\ = (TiCi + ^crjCp^^i^j) <c 2 + n\\c\\x. (9) 

Together with the condition that ci > C2 + 2/x||c||i the above estimates ensure that the maximal 
inner product is attained by i p , ie. 



|$*$c 



p.o-lloo = max |(^,$Cp j0 .)| 
i=i. ..if 



Using the concrete expression for the maximal inner product we quickl}^] arrive at, 

E x ||$**a;||£, = E p E CT (\{</>i p ,$Cp <tT )\ 2 ) 



= E P U + ^c^-K^,^)!' 
= c?+ ( ^f ) (^-l). 

To compute the expectation for a perturbation of the original dictionary we first note that we 
can parametrise all e-perturbations ^ of the original dictionary <3? with ||Y>i — = £ i — £ as 

for some z% with 2$) = 0, \zi\% = 1 and at := 1 — £ 2 /2 and := (e 2 — Expanding the 



expectation as before we get, 



EJ****!!^ =E p E a \\V*<f>c Pt(T \\l 



EpE a ( max | (tpi, <S>c P}<7 }\ 2 



=1...K 



E p E CT ( .max_ \ai((j)i, $c PiCT ) + ^(zj, $c Pj(7 ) 



(10) 



2. More detailed computations of the expectation can be found in Appendix A.l 
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The main idea for the proof is that for small perturbations and most sign patterns a the maximal 
inner product is still attained by i such that p(i) = 1. For p fixed and i p = we now have 

|(YV$Cp,a)| = \a ip ((t>i p ,<f>Cp :(T ) + L} ip (z ip ,<S>C Py(7 )\ 

= \a ip a ip d + a ip <rj c p(j)(<t>i P ,<f>j) + ^ ip Y a o c v{i)^ <A?)I 
> a ip ci - a ip n\\c\\i -u ip \ ^ ^jCp(j)(zi p ,<pj)\. 

Using Hoeffding's inequality we can estimate the typical size of the sum in the last expression, 

z ip , $Cp ;a )\ >t) = P(\ Y °jCp{j){Zi P Aj)\ > t) 



In case oji p ^0 or equivalently ei p / 0, we set t = sc2/uj ip to arrive at 

P K \{zi P i®Cp,c)\ > «*) < 2exp ^-l^j^j < 2exp ^"^^ , 
where we have used that uf = ef — ef /4 < ef . 

tp ip Lp I Lp 

Similarly for i / i p we have 

\(tpi,$c P:<7 }\ = \a.iOiCi + a i 'Y (J j c p(j)( ( l ) '^ 4>j) +^y^cjjCp(j)(^,^)| 

< OjC 2 + aj/U||c||i + UJi\ Y J (7 3 C p(j){ z ii ( t>3)\i 

and, by Hoeffding's inequality, 

F(\(z i ,^c p , a )\>t) = P(\J2^ pU) (z i , ( l> j )\>t) 

+2 \ / t 2 



Thus in case uji , e « / we get 

P(wi|(zi,$Cp ><r )| > sci) < 2exp ^-^J2^ < 2exp ^ S 
Note that in case E{ = we trivially have that 

P(wi|(^,$Cp )C7 )| > sc 1/2 ) = 



2Ae 
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Summarising these findings we see that except with probability rj := 2^j| e ^o ex P ( — 2l£?)' 

\{ip ip ,$Cp ta )\ > a ip ci - a ip fj, || c[|i - sc 2 and 
|(V>i, *Cp )(T }| < ajCp(j) + at/i||c||i + sci Vz / i p . 

This means that as long as ai v c\ — aj p/ u||c||i — SC2 > OiiC p u\ + ai^\\c\\\ + sc\ for all i ^ i p , which 

C2+C1 



is for instance implied by setting s = 1 — y — 2 C2 ^[, C ^ 1 we have 



II^^Cp^Hoo = max ■ Ui)i,$c p ^)\ = \(ip ip , $c p>(T )|. 

2 = 1. ..K 

We now use this result for the calculation of the expectation over a in pO) . For any permutation 
p we define the set, 

S P := LH " S - t Wi|(Zj,$C PiCT )| > SCl}U{cr S.t. Ui p \(Zi p ,$Cp )a )\ > sc 2 }. 

We then have 



The sum over S p can be bounded as, 



2 

oo' 



E P ( fJ ) • 11**^11^ < P(S P ) • max ||**$Cp,„|& < r? • A, 
while for the complementary sum we get, 



< E ^) I Wi, , ^ c p,o) 1 2 = E CT (I (Vip , *cp ><r ) | s 
Re-substituting these estimates into ( fTo) we get 



<Ep(A»; + E ff (|(^ p ,$ci, I<T )| 2 )) 

= a v + § e i<fc, ^) i 2 + ^ - i e 1 (■**> ^)i 2 ) 



Again more detailed calculations can be found in Appendix A.l Recalling the definition that 



V = 2S^o ex P (-2&) with s = 1 " T " 2< ^T ^ that 1(^)1 = «i = 1 - ff2 /2 leads us 



March 28, 2013 



DRAFT 



11 



to, 



ly < 2A ^ exp (-^Q + | E(l " ^/2) 2 + - ^£(1 - *?/2) s 

e^O ^ * ' i=l \ i=l 



< + M - 1) 



+ E *»«P (-^) - | EM - 4/4) + ^ EM - 4/4) 

£i ^0 v 2 7 4=1 v ; i=l 

= E x \\$** X \\l + 1 £ exp (-^) - cfc? - 4/4) + - 4/4)) . 

Thus to prove that E^H^^^xH^ < E^H^^^xH^, for all e-perturbations \P, it suffices to show that 
for all < Si < e we have 

2^ exp f - l 2 2A£ ^ ) j - c 2 (e 2 - 4/4) + ^(4 - 4/4) < 0. 

Since both e~ c l e ~ and e 4 tend much faster to zero than e 2 as e goes to zero, this condition will be 



satisfied as soon as e is small enough. Using some trickery that can be found in Appendix A.2 
we can show that indeed all is fine if e < 1/5 and 

(l _ 2 c 2+mI|c||i \ 2 

< V c 2+c! ; 

" 2^1og ( 2 ^/(c?-^))' 

□ 

Let us comment the result. 

Remark 2.2: (i) First one may question why we chose the complicated approach above 
instead of doing a first order analysis using the the tangent space to the constraint manifold V, 
as in [17] . The answer is simple, it fails. As can be seen during the proof, the first order terms 
0{e) are zero, requiring us to keep track also of the second order terms 0(e 2 ). 
(ii) Next note that in some sense Theorem |2.1| is sharp. Assume that <I> is an orthonormal basis 
(ONB) then \i = and the condition to be a local minimum reduces to c\ > C2- However from 



Example 2.1 we see that if c\ = C2 we can again construct an ascent direction and so <3? is not a 
local maximum. 

(iii) Similarly the condition that <1> is a tight frame is almost necessary in the non-trivial case 
where |ci| < 1. 

Assume that $ is not tight, ie. ^4||z;||| < \(v,(j)i)\ 2 < B\\v\\2, with A < B. Going through the 
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proof of Theorem 2.1 we see that using the same arguments, we again have 

E x ||$*$x||^ = E p E CT (|(</ v $c p , CT )| 2 ) and E^l^x^ > E P E CT (| (Vv *c p , CT )| 2 ) 
and by replacing A with B where relevant get the new upper bound, 

EJI^II^ <E p E a (|(^ p ,^c Pi(T )| 2 ) + 2Bfj 

for 

/ fl S 2 Q C3+/x||c[|: 



^ = E exp 



2 c 2 +ci 



V 



(11) 



Since i??? is still of order o(e 2 ) to prove that <& is a local maximum it suffices to show that 
up to second order E p E CT (\(4>i p , <3?c p>cr )| 2 ) > E p E CT (\('ipi p , <£c P)0 -)| 2 ). Conversely if we can find 
perturbation directions zi such that the reversed inequality holds, $ is not a local maximum. 
Using the explicit expressions for the expectations from the appendix, we get 

E P E CT (|(0 ip ,$c p , CT )| 2 ) -E p E a (|(^ p ,$c p , CT >| 2 ) 

= ci + (ii^iii -k)-§ - (\\**m 2 F - E k^>i 2 ) 
= ( c? - ^) k E (1 " a?) + y0T) (e n^ni - E + 

Recalling that a, = 1 — ej/2 and uii = {ej — e|/4)5, we see that all terms in the above expression 
are of the order 0(e 2 ) except for the last J2i Q! i<* ; i(***0ij #i) which is of order e. Now assume that 
there exists an atom <fii and an orthogonal perturbation direction z, such that (&$*(j)i , z) / 0, 
then for ^ with = a.i4>i Q + crwz, where a = sign(($$*0j o , z)), and ipi = 4>i for all i / z'o, the 
expression above will be smaller than zero as soon as e is small enough, meaning that <E> is not 
a local maximum. 

Consequently a necessary condition for $ to be a local maximum is that (<E»$*0j, z) = whenever 
{4>i,z) = 0, which is equivalent to every atom being an eigenvector of the frame operator of 
the dictionary, ie. $<3?*</>j = Aj0j, V«. While this condition is certainly fulfilled when $ is a tight 
frame (corresponding to Aj = A), it is sufficient for $ to be a collection of m tight frames for m 
orthogonal subspaces of M d (corresponding to the case <3? = (*Ad • • • > *A m ) with <I>cj)*<j> A = A^aJ. 
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Going through the same analysis as in the proof of Theorem 2.1 we see that in this second case 
$ is again a local maximum under the additional condition that c\ > ^Za+K ' where A = mirij Aj 
and B = maxj Aj. However, for simplicity we will henceforth restrict our analysis to the situation 
where <& is a tight frame. 

2.4 A continuous model of decaying coefficients 

After proving a recovery result for the simple coefficient model of the last section we would 
like to extend it to a wider range of coefficient distributions, especially continuous ones. To see 
which distributions are good candidates we will point out the properties of the simple model 
we needed for the proof to succeed. 

• To see for which index the inner products $c PiCT ) were maximal, cp. (|8p[9]l, we used the 
decay-condition c\ > C2 + 2//||c||i. 

• For the calculation of Ep jCr |(<3?j p , <£c P)0 -)| 2 we used that the largest coefficient was equally 
likely to have any index, which was ensured by the fact that each permutation of the base 
sequence c was equally likely. 

• Finally to bound the size of the inner products (zi, c PjCr ) and thus the size of (tpi, c Pj(7 ) with 
high probability we needed the equal probability of all sign patterns. 

Using these three observations we can now make the following definitions 

Definition 2.1: A probability measure v on the unit sphere S d ~ l C M rf is called symmetric if 
for all measurable sets X C S d ~ x , for all sign sequences a G { — 1, l} d and all permutations p we 
have 

u(aX) = v(X), where aX := {{cr\x\, . . . , (JdXd) '■ x € X} (12) 

v(p(X)) = v(X), where p{X) := {(x p{1) , . . . , x p(d) ) : x <G X} (13) 

Definition 2.2: A probability distribution v on the unit sphere S* -1 C M. K is called (/3,/z)- 
decaying if there exists a j3 < 1/2 such that for c\{x) > C2(x) > . . . > Cd{x) > a non increasing 
rearrangement of the absolute values of the components of x we have, 

'■TTrnr ^)- 1 (14) 

c 2 {x) +c 1 [x) ) 

For the case \i = it will also be useful to define the following notion. A probability distribution 
v on the unit sphere S^ 1 C M. d is called /-decaying if there exists a function / such that 

exp(-^f) =o(e 2 ) (15) 
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and 

"di^-'M)- ^ (16) 

Note that (/3, 0)-decaying is a special case of /-decaying, ie. /(e) can be chosen constant /3. To 
illustrate both concepts we give simple examples for (/3, /z)- and /-decaying distributions on S 1 . 



Exam-pie 2.3: • Let v be the symmetric distribution on S 1 defined by C2(x) being uniformly 
distributed on [0, — 9] for 9 > (and accordingly c\(x) = y/l — c|(a;)), then v is 
decaying for all /z < 

• Let be the symmetric distribution on 5" 1 defined by C2{x) being distributed on [0, -j=] with 
density 20y/2(^ — x) 4 , then v is /-decaying for e.g. /(e) = yfe. 

• Let v be the symmetric distribution on S 1 defined by C2{x) being distributed on [0, ^] with 
density 4(^ — x), then v is not /-decaying. 

While the decay properties for the first two examples follow from basic integrations, we 
will elaborate shortly on this last example. For any function / we have the lower bound, 

C2W >1-/( £ ))=4P (±-x)dx 



Cl(x) J I W(') \y/2 

v / 2-2/(£) + /(e)" 



> 



This means that we need /(e) 2 = o(e 2 ) at the same time as exp ( — tjQ- ) = o(e 2 ), which is 



8e 

impossible, so v cannot be /-decaying. 
An important group of probability distributions expected to be (/3, /i)-decaying are the distri- 
butions introduced in to model strongly compressible, ie. nearly sparse vectors. 

With these examples of suitable probability distributions in mind we can now turn to proving 
a continuous version of Theorem 12.11 

Theorem 2.2: (a) Let $ be a unit norm tight frame with frame constant A = K/d and coherence 
li. If x is drawn from a symmetric (j3, /i)-decaying probability distribution v on the unit sphere 
S K ~ 1 , then there is a local maximum of (|6) at <i> and we have the following quantitative estimate 
for the basin of attraction around <I>. Define c\ := E^H^H^. For all perturbations \t = (ipi . . . tpx) 
of $ = (0i . . . <Pk) with < maxj - <^|| 2 < e we have E^H**^^ < E. r ||<3?*<I>:z;|| 2 X3 as soon as 
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e < 1/5 and 



~ 2A\og(2AK/{<? 1 -"0 l , 



(b) If $ is an orthonormal basis, there is a local maximum of <|6]l at <I> whenever x is drawn 
from a symmetric /-decaying probability distribution v on the unit sphere S x . 

Proof: (a) Let c denote the mapping that assigns to each x € S^ 1 the non increasing rear- 
rangement of the absolute values of its components, i.e. q(x) = \x p ^\ for a permutation p such 
that ci(x) > C2(x) > . . . > Cd{x) > 0. Then the mapping c together with the probability measure 
v on S K ~ X induces a pull-back probability measure v c on c(S K ~ 1 ), by f c (0) := z/(c -1 (f2)) for any 
measurable set Q C c(5' x_1 ). With the help of this new measure we can rewrite the expectations 
we need to calculate as, 

E x \\^xf 0O = / W&QxWlodvix) = [ E p E CT ||$*$c p , CT (x)||^ c (x). 

J x J c(x) 

The expectation inside the integral should seem familiar. Indeed we have calculated it already in 



the proof of Theorem 2.1 for c(x) a fixed decaying sequence satisfying c\(x) > C2(x) + 2/x||c(x)||i. 
This property is satisfied almost surely since v is /^-decaying and so we have, 

E x ||$*$x||^ = I EpE,j (\(<f> ip , 3>c Pi(7 (x))| 2 ) di/ c (x) 
c 2 (x) + i-^(,4-l)^ c (x) 

c(x) A — 1 

„2 



j c\{x)+ 1 - 1 ^{A-l)d V {x). 



Note that the integral term f, x , ci{x) 2 dv c (x) is simply E^Hx^ = c\, leading to the concise 
expression for the expectation, 



1-c? 



E x ||^ x ||^ = c -2 + _^I(^-l). 



For the expectation of a perturbed dictionary ^ we get in analogy 

EjMxH^ < / A V (x)+E p E a (\{i; ip ,<£>c Pta (x))\ 2 )dv c (x), (17) 



where 



77(2) := 2 exp 



_ 2 c 2 (^)+mI|c(x 

2 c 2 (x)+c 1 (a;) 



2Aei 
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Define, 

(l-f-2/3) : 



:= 2 £ exp 



2A £ 2 J • 

then since z/ is /x) -decaying ^(x) < 77^ almost surely. Continuing the estimate in ( p~7] > we get 



Following the same argument as in the proof of Theorem |2.l| we see that E^H^^x!! 2 ^ < 
E^ll^*^!!^ once we have e < 1/5 and 

(1 - 2/?) 2 



e < 



2,4 log ( 2 ART/ (cf- 



*^\^l|2 _ it ||^||2 _ ^2 



(b) If $ is actually an orthonormal basis, ie. A = 1, we simply have E a; ||<l>* < i>:E||^ = E^x^ = c{. 
However if v is only /-decaying we need to be more careful in our estimation of E^H^^x)! 2 
Let 1 denote the index for which £j is maximal. We have, 

Jx:^>l-}(e L ) A:2M<l-/( £l ) 

< v ( C J^ > 1 _ f ( £ j\ + /" ]FyE ff ||***<^(x)||^di/ c (x) 

\CI(X) ) 7 c (x):^f}<l-/( £t ) 

For convenience we write £7 := {c(x) : < 1 — /(e t )}, leading to 

Ej^aC < * > 1 - + j^(x)+E p E (T (|(^,$Cp i(T (x))| 2 ) c^ c (x) 



where 



1 _ £? _ 2e 2 (x) 
C 1 o I V 2 Ca(«)+ci(a;) 

%{x) := 2 ^ exp — ^ 

(/(eQ-e?) 2 



As long as c(x) G SI we have r/ L (x) < 2^ e ^o ex P (~~ 8e 2 J 7 so we can further bound 

E s ||^x||^<^d||>l-/(^ 

2N2 



ci(x) / " \ 8e 2 

+ § £i<^, ^>i 2 + (1 - \ £1^, &>i : 
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leading to the following estimate, 

E^MxC - M **** = » (||| > 1 - m) + 2<iexp (- (/(e '^ S ' )2 ) 



f)^E«-4/4). 



d 

The terms in Ej in the above estimates are clearly smaller than zero for Ej < e L < 1 so to finish 
the proof all that remains to be shown is that 



This, however, is guaranteed by v be /-decaying, which ensures that the first two terms in the 
above expression are of order o(e^) and therefore smaller than the third term of order O(ef), as 
soon as e L is close enough to zero. □ 
Remark 2.4: It would of course be possible to extend the notion of /-decaying to (/, p)- 
decaying. However, for p > the condition c\ > c% + /x||c|| is only sufficient but not necessary 
for f to be a local minimum. It is merely the result of using the simple but crude bounds in 
(|8} and Q and could for instance be replaced by (1 + \i)c\ > (1 — [i)c2 + mII c IIi- Thus unless we 
have a sharp bound on the coefficient sequence for <J>c Pj(7 )| to take its maximum uniquely 
at i = i p it is quite useless to try to approach this bound in probability. 

2.5 Bounded white noise 

With the tools used to prove the two noiseless identification results in the last two subsections 
it is also possible to analyse the case of (very small) bounded white noise. 

Theorem 2.3: Let $ be a unit norm tight frame with frame constant A = K/d and coherence 
\i. Assume that the signals y are generated from the following model 

y = $ x + r , (18) 

where r is a bounded random white noise vector, ie. there exist two constants p, p ma , x such that 
IMI2 — Pmax almost surely, E(r) = and E(rr*) = p 2 I. If x is drawn from a symmetric decaying 
probability distribution v on the unit sphere S K ~ X with E^lccH^, = c\ and the maximal size of 
the noise is small compared to the size and decay of the coefficients c±, 01, meaning there exists 
P < 1/2, such that 

c 2 (x) + /i||c(x)||i + p max 



ci(x) - c 2 (x) 



<P)=1 (19) 
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then there is a local maximum of (|6]) at <E> and we have the following quantitative estimate for 
the basin of attraction around <£. For all perturbations \P = (ip% . . . ipx) of $ = ((pi . . . 4>k) with 
< maxj \\tpi — 4>i\\2 < e we have EyH^yH^ < E^ll^yH^ as soon as e < 1/5 and 

(1-2/3) 2 



£ < 



2 A log (2AK/{c\ 



1-5? 
K-1- 



Proof: We just sketch the proof, since it relies on the same ideas as those of Theorem 2.1 and 



Theorem 2.2 Condition ( 19 1 ensures that with probability 1 maxj | (0,, y) \ = maxj | (cpi, &x + r) | is 
attained for i = i p , so we have 

EyW^yWl =E Xir \(0 lp ,<S>x + r)\ 2 

= E x | (<j> ip , <Dx) | 2 + E r | (<p ip , r ) | 2 = E, | , $x) | 2 + p 2 . 

Similarly max; |(^, y)| = maxj K^, $x + r)| is attained for i = i p except with probability at most 



^ := 2 X exp 

leading to 



(1 - % - 2/3) 2 



2vle 



2 



E y 1 1 1 1 1 < A m + E x , r | , $x + r ) | 2 

= y% + E x | , $x) | 2 + E r | , r) | 2 = + E x | , $x> | 2 + p 2 . 
The result then follows from the usual arguments. □ 

3 Asymptotic identification results for S > 1 

In this section we extend the identification results from the last section to the case where S > 1, 
ie. we study the problem 

m^E^maxllPK^ylll)- (20) 

We use essentially the same tools as for the 1-sparse case. However, since the problem does not 
reduce, the proofs become more technical - for instance we need to estimate the difference 
between Pi(&) and Pr(\P) instead of 4>i and tpi and need a vector version of Hoeffding's 
inequality to estimate the typical size of Pj(^)^c PjCr . So to keep the presentation concise we rely 
heavily on the O, o notation. Also the results are in a different spirit. We trade concreteness, such 
as explicit conditions on the coefficient sequence for $ to be a local maximum or an estimate 
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for the basin of attraction, for sharpness by formulating our results as tight as the available 
tools permit. 



We start by proving a general version of Theorem 2.1 for the simple coefficient model intro 



duced in Section 2.3 which will again lay the ground work for the more complicated signal 
models. 

Theorem 3.1: Let $ be a unit norm tight frame with frame constant A = K/d and coherence 
fi. Let a; be a random permutation of a positive, nonincreasing sequence c, where c\ > c 2 > 
C3 ■ • • > ck > and ||c||2 = 1, provided with random ± signs, i.e. x = c PiCr with probability 
, cr) = (2 K K\)~ 1 . Assume that the signals are generated as y = &x. If we have 

W,p: \\Pi p ($)$Cp } 4 2 > max W^^c^h, where I p := p' 1 ({1, . . . S}) , (21) 



then there is a local maximum of ( |20| at $. 

Proof: We first calculate the expectation using the original dictionary <£. Condition pT) 
quite obviously (and artlessly) guarantees that the maximum is always attained for the set I p , 
so setting 7 2 := cf + . . . + c| we gelj^J 

E y ^max||Pr($)y||iJ =E p E a (WPj^CpAl) 



V* 5 K-s)\s) ^y" 

_ 2 (A-l)(l- 7 2 )g 

7 

We use the same parametrisation for all e-perturbations as in the last section. Since we have to 
calculate with projections Pi(^) we also define Aj = diag(aj)i £ / and Wi = diag(wj)j G / to get 
^ / = $jAi + Z/Pt^. 

As in the case S = 1 our strategy will be to show that with high probability for a fixed 
permutation p the maximal projection is still onto the atoms indexed by I p . 
For any index set / of size S we can bound the difference between the projection using the 
corresponding atoms in ^ or $ using the reversed triangular inequality, 

|||Pi(*)#y|| 2 - ||Pj(*)*y|| 2 | < || (Pi(tf) - Pj(*))*y|| 2 . (22) 

3. for a detailed calculation see Appendix B.l 
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To estimate the typical size of the right hand side in the above equation we need a vector valued 
version of Hoeffding's inequality. We take the following convenient if not optimal concentration 
inequality for Rademacher series from [21], Chapter 4. 

Corollary 3.2 (of Theorem 4.7 in JI23I/ ): For a vector- valued Rademacher series V = s ^ ji (JiVi, ie. 
for (Ji independent Bernoulli variables with P(<7j = ±1) = 1/2 and vi 6 W 1 , and t > we have, 

nmb>t)<2exp(^ FB ). (23, 

Applied to Vi = c^,m (P/(\l/) — Pj(<i>))0j this leads to the following estimate, 

P(\\[P I (V)-P I ($))$c p ,4 2 >t) <2exp 



-t 2 



< 2exp 



32Ei4oll(W)-W))« 

32Ei&JW)-W)lllJ 



_/-' 

< 2exp 



-r 



32||p J (*)-p / (^)ilFy ' 

whenever P 7 (#) / Pj($) (otherwise we trivially have P (||(P/(^) - Pj($))$c PiCT || 2 > t) = 0). 



From Appendix |B.2| we know that ||P/(^) - P/($)||| = OGIQ/^zW/^Hl), wh ere Q/($) is 



21 



we 



the projection onto the orthogonal complement of the span of so we finally get, 

P (II (PiNf) - P/(*))*Cp ff || 2 > t) < 2exp | — — ——) . 

Define re := | min Pi(T (\\P Ip (^)^c P:<7 \\ 2 - max| 7 |< 5i/ ^ /p ||P/($)$c Pi(J || 2 ), then by Condition 
have k > and 

||P/ p (*)$Cp lCT ||2 > ||P/ p ($)$Cp )(7 || 2 - K 

> max ||P/($)$c^ iCT || 2 + k> max ||P/(*)$Cp )0 .|| 2 , 

I . I^-Tp 1 .1^-1 p 

with probability at least Vs = 2 Elq^^z^aj^o ™V ^{wq^z^aj^I) )- To calculate the 
expectation E CT (max|j|<,g ||Pj-($)c Pi0 -|| 2 ) we again define a set S p , 

S P= U H(^(*)- P /W) $C P-ll2 
h\I\=S 

Splitting the expectation in a sum over the sign sequences contained in £ p and its complement, 
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we can estimate, 

E a ( max||Pr(*)cp jCr ||| ) = V max ||P/(tf)$Cp, CT ||i + V max ||P/(^)$Cp )<7 ||| 

<P(£ p )max||$Cp, CT ||l + £ ll^/ P (*)^lll 



cr£E p 

(TgS„ 



<r ?s ^ + E CT (||P /p (*)$c Pi(T ||l) 



Using the expression for EpE^ (ll-Prp^^c^o-llg) derived in Appendix B.l we get the following 
bound for the expectation of the maximal projection using a perturbed dictionary, 

Finally we are ready to compare the above expression to the corresponding one for the original 
dictionary We abbreviate A = £ - ^§ and 5/ = Z/PFMj 1 . Employing ||P/(*)$j||| = ||*/|||- 



WQj^BjW^ + 0(||Q/(*)fl / ||f,||.Bi|| i? from Appendix B.2 we get, 



E, LaxllP^H^ -E, Lax||Pr($)y||l 



<^ £ ^{ 32 \\P m -P im i ) +A @" 1 E (11^(^/111-11^111) (24) 



^ E 2Aex P ( nnm jL.^ ) " M5)" 1 (IIQ/W^IIf + 0(||Q/(<I>)£/|||||B/||i,)) 



7:. 



O (||Q/(*)B/[||) 



Using the usual arguments we see that for e / the above expression is strictly smaller than 
zero as soon as e and consequently < ||-B/|| F < Se 2 /{\ — e 2 ) are small enough, 

showing that there is a local maximum of |20| at $. □ 
Remark 3.1: To make the above theorem more applicable it would be nice to have a concrete 
condition in terms of the coherence of the dictionary rather than the abstract condition in pi) . 



Indeed it can be shown, see [27[ Appendix C, that Condition (21 1 is implied by the following 
decay of the coefficients 

1 — S/j, 4it \ - 

cs> t^ cs+1 + t^ E u (25) 

r ^ i>S+l 

for SpL < 1/2. Up to a factor this corresponds to the decay condition for the case 5=1. 



We will now state a version of Theorem 3.1 for a continuous coefficient model, analogue to 



Theorem 2.2 'a). However we will omit the proof since no new insights can be gained from it. 



March 28, 2013 



DRAFT 



22 



Theorem 3.3: Let $ be a unit norm tight frame with frame constant K/d and coherence \x. Let 
x be drawn from a symmetric probability distribution v on the unit sphere and assume that 
the signals are generated as y = Qx. If there exists k > such that for c(x) a non-increasing 
rearrangement of the absolute values of x and I p := pr 1 ({1, . . . S}) we have, 



I min ( | 

\p,<r V 



P Ip (mcpAx)h - , max {{Pj^cp^h ) > 2« ) = 1 (26) 



then there is a local maximum of d20l| at 



Proof: Apply the technique used to prove Theorem 2.2 to the results derived in the proof 
of Theorem 13.11 □ 
Remark 3.2: (a) Again the abstract condition in ( |26| can be replaced by a decay-condition on 



the coefficients involving the coherence, ie. analogue to (25 1 we have for 5// < 1/2, 



E m + 2k ) 

>5+i / 



" (<* > i^m^' + r^h ^ |Cil + 2K 1 = L (27) 



(b) Note that with the available tools it is also be possible to extend Theorem 3.3 to signal models 
with coefficient distributions approaching the limit in ( [26) , ie. k = 0, or including bounded white 
noise. However, to keep the presentation concise, we leave both the formulation and the proof 



of generalisations corresponding to Theorems 2.2 b) and 2.3 to the interested reader, and instead 



turn to the analysis of the practically relevant case when we have a finite sample size. 

4 Finite sample size results for S = l 

Finally make the step from the asymptotic identification results derived in the last two sections 
to identification results for a finite number of training samples. Again we start with the simple 
case when 5=1, ie. we consider the maximisation problem, 

N 



71=1 

The main idea is that whenever ^ is near to $ we have 

N , N 



|2 

11^ Vn\ 

n=l n=l 

Concretising the sharpness of « quantitatively and making sure that it is valid for all possible 
e-perturbations at the same time, leads to the following theorem. 

Theorem 4.1: Let <I> be a unit norm tight frame with frame constant A = K/d and coherence 
/j,. Assume that the signals y n are generated as y n = <3?x n + r n , where r n is a bounded random 
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white noise vector, ie. there exist two constants p, /9 max such that ||r n ||2 < /Wx almost surely, 
JE(r n ) = and E(r n r*) = p 2 I. Further let x n be drawn from a symmetric decaying probability 
distribution v on the unit sphere S K ~ 1 with M^HxH^-, = c\ and the maximal size of the noise be 
small compared to the size and decay of the coefficients c\ , C2, meaning there exists (3 < 1/2, 
such that 

/ c 2 (x) + / x||c(x)|| 1 + Pmax ^ \ =1 
\ c 1 {x)-c 2 {x) J 

Abbreviate A := cf - and Cl ■= (v^4 + /> max ) 2 . If for some < q < 1/4 the number of 
samples N satisfies 

N - q + N -2y R < ^~ 2 f^ (30) 

then except with probability 

exp ( 4r2Cl + Kd\og{NKC L /\) 

there is a local maximum of ( |28| resp. local minimum of ([l} with 5 = 1 within distance at most 
2N~ q to <E>, ie. for the local maximum ^ we have max^ \\ipk — (f)k\\2 < 2N~ q . 

Proof: Conceptually we need to show that for some e m i n (N) < e mSLX (N) and with probability 
p(N) for all perturbations \& with £ m m(N) < max*. \\(f)f. — ipk\\ < e max (N) we have 

N N 

^Ell^H->^Ell^l| 2 o- (3D 

n=l n=l 



To do this we need to add three ingredients to the asymptotic results of Theorem 2.3 1) that 
with high probability for fixed perturbation the sum of signal responses concentrates around 
its expectation, 2) a dense enough net for the space of all perturbations and 3) that the mapping 
^ — > H^yll^o is Lipschitz. Then we can argue that an arbitrary perturbation will be close to a 
perturbation in the net, for which the sum concentrates around its expectation. This expectation 
is in turn is smaller than the expectation of the generating dictionary, around which the sum 
for the generating dictionary concentrates. We start by showing that ^ — > W^yW^ is Lipschitz 
on the set of all perturbations ^ with max^ \\ipk — ^fclb < 1/2. For simplicity we will write from 
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now on d(^, \&) := maxfc 



ipkh- We have, 



1**2/1 



< 



1**1/11 



maxKV'fc + (4>k - i>k),y)\ 2 - max\(i/)k,y)\' 

k K 



max(\(ip k ,y)\ +2\((if) k ~ 4>k),y)\\(ipk,y)\ + KOfc -4>k),y)\ ) - max|(V>jfe,Z/)|' 

k k 



< 2\\y\\l max \\tp k - ip k \\ 2 + \\y\\ 2 2 max ||^ fc - ^ k \\\ 

k k 

<3\\y\\l-d(V,y) 

Since the signals y n = + r n are generated from a tight frame with unit norm coefficients 
and a bounded white noise vector, we have * — > 4^ S n= i ||**2/n||ix) * s Lipschitz with constant 

3( V / I + Pmax) 2 . 

Next we use Hoeffding's inequality to estimate the probability that for a fixed dictionary ty, the 
sum of responses jj Yl n =l ll**^lloo deviates from its expectation. Set Y n = W^^ynWlo, then we 
have Y n € [0, (VA + p max ) 2 ] and get the estimate, 

-Nt 2 



P 



\ n=l 



2 

oo 



> t < exp 



A 



The last ingredient is a 5-net for all perturbations $ with ^) < e maX / ie. a finite set of 
perturbations J\f such that for every ^ we can find # G A/" with <i( x I / , ^) < 5. Remembering the 



parametrisation of all e-perturbations from the proof of Theorem 2.1 we see that the space we 
need to cover is the product of K balls with radius e max in Following e.g. the argument 

in Lemma 2 of fl32| we know that for the m-dimensional ball of radius e m ax we can find a 5 net 
M m with 

ttA/jTi < ( £ max + 



Thus for the product of K balls in IR ' 1 we can construct a 5-net TV as the product of K 5-nets 
Nd-i- Assuming that 8 < 1 we then have, 



+ 



2s 



max 



K(rf-l) 



< 



3e 



max 



K(d-l) 



Using a union bound we can now estimate the probability that for all perturbations in the net 
the sum of responses concentrates around its expectation, as 



P 3f ejV 



i N 



2 

nlloo 



E(||** yi | 



n=l 



> t < 



. A^' / -Nt 2 
exp 
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Finally we are ready for the triangle inequality argument. For any \& with <£) = e < e T 
we can find ^ G M with d(^/, VP) < S and assuming wlog that $ G TV we have that 

1 * 1 * 



n=l n=l 



N 

'Mill, - m^vwio + n&vwio - n**v\\io 



i N 

-Y\\& 



n=l 

N „ jV , JV 



+ eii^iil - 1 E PVIIL + ~ E ii*^h 2 oo - ^ E H**a»ll« 

n=l n=l n=l 

>E||$*y||^-E||**y||^-2t-3<5Ci 
Next we identify e max up to 5 by showing that for <£) = e < e max we can lower bound the 



sum in the last equation by j^e 2 /2. Following the argument in Appendix A.2 with the necessary 
changes we see that for e < 1/5 and 

(1 - 2/3) 2 
- 2Alog(4AK"/A) 

we have 

Thus as soon as e < 2 A\tg{iAK/\) ~ 5 := £max we have 

A E H^IIL - ~ E H^IIL > ~ - a - 3^ 

n=l n=l 

> A<£^-2t-3*C L > A^_ 2t -45C L . 

A Z K Z 

If for g < 1/4 we choose i = N~ 2q X/(2K) and 5 = N~ 2 i\/ \AKC L ) then except with probability 

exp ( ^ +K(d-1) \og(12e m ^C L KN 2q /\) J 

we have 

: iV j N 

^En^ii 2 oo>^E W^VnWl 

n=l n=l 

whenever e > 2N~ q := e m - m . The statement then follows from the simplification that e max < 1/5 
together with TV 1 " 4 '' > 4K 2 implies 12e max N 2q < N and from verifying that e min < e max . □ 
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Remark 4.1: Note that the above theorem is not only a result for the K-SVD minimisation 
principle but actually for K-SVD. While for S > 1 the decay-condition is not strong enough 
to ensure that the sparse approximation algorithm used for K-SVD always finds the best ap- 
proximation as soon as we are close enough to the generating dictionary in the case S = 1 
any simple greedy algorithm, e.g. thresholding, will always find the best 1-term approximation 
to any signal given any dictionary. Thus given the right initialisation and sufficiently many 
training samples K-SVD can recover the generating dictionary up to the prescribed precision 
with high probability. To make the theorem more applicable we quickly concretise how the 
distance between the generating dictionary <I> and the local minimum output by K-SVD ^ 
decreases with the sample size. If we want the success probability to be of the order 1 - N~ Kd 
we need 

_ 7V i-4 ?A 2 

AK^C L + Kdl °s(NKC L /\) « -KdlogN, 
or iV 1 ^ » K 3 d\ogN meaning that -q ps -\ + Thus we have 

log (d($, *)) = -q\ogN ^ -^j-^ +\ogK or c2($>, \P) KN^ 1 ^ 4 (32) 

5 Finite sample size results for S > l 

Let us now turn to the analysis of the problem with S > 1, ie. 

1 N 

max — Vmax ||P/($)y n ||2. (33) 

n=l 

As for the asymptotic case we will be less concrete but more precise and instead of using 
the coherence will give the results in terms of the lower isometry constant of the generating 
dictionary, which is defined as the largest distance of the smallest eigenvalue A m i n of <E>^<1>/ to 
1, ie. 6s := max| 7 |<s(l — A m i n ($j$/)). For simplicity we again state only the noise-free version. 

Theorem 5.1: Let <3? be a unit norm tight frame with frame constant A = K/d, coherence fi and 
lower isometry constant 5s < nS. Assume that the signals y n are generated as y n = $x n , where 
x n is drawn from a symmetric decaying probability distribution u on the unit sphere S K ~ 1 , and 
that there exists k > such that for c(x) a non-increasing rearrangement of the absolute values 
of x, ie. c\(x) > C2(x) . . . > ck(x) and I p := p" 1 ({1, . . . S}) we have, 

min {\\P Ip {^c^{x)h ~ |7| max^ ||P 7 ($)$c p , CT (x)|| 2 ^ > 2^ = 1. (34) 
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and abbreviate As := ^ — and Cs ■= ( 1 — gTrjg \ ) ■ If for some < g < 1/4 the number 



Define jg as the expected energy of the 5 largest coefficients, i.e. 7! := E x .(cf(x) + . . . c|(x)) 
of samples iV satisfies 

TV-? + iV- 2 7K < - , (35) 



68V51og(5^ 5 /A) 
then except with probability 

GXP V i^4~ + ™° g V ^S~ ) ) ' 

there is a local maximum of (33 1 resp. local minimum of (Jl) within distance at most 2N~ q to 
<]?, ie. for the local maximum ^ we have max/,, ||^ — </>fc||2 < 2N~ q . 

Proof: The proof follows the same strategy as in the simple case. However since we now 
have to deal with projections instead of simple inner products we have to suffer a bit more. 
Again we first show that the mapping ^ — > maxiji<s [l-P/C^ynlli * s Lipschitz G n the set of 
perturbations with d(ty, $) < e m ax- We have, 

max ||Pj(*)y„||l - max \\Pi^>)y n \\l 
\i\<o MIS'-' 

max ||P,(*)y n - (Pj(¥) - P/(*))i/ n ||l " max ||P/(*)y„||! 

|J|<£> l-<l<i> 
= 2 max ||(Pj(tf) - P/(*))i/ n || 2 max ||P/(*)y n || 2 + max || (Pj(tf) - Pj(*))j/„||1 

<3Amax||Pj(*)-Pi(*)|| a ,2. 
Following the line of argument in Appendix |B.2 we know that 



2S 



||PW - P/Wllls < ||PW - Pj(*)|fr < L "'"*- ,J " J 



1^112 2 11^1112 2-2^ 

7112,2 I II 7112,2 v ^^ji 



Now note that ||'I / / || 2 \ is simply the minimal singular value of ^7. Remembering that 26 implies 
5s < 1 we therefore have, 

Il*/ll2,2 = °"min(*i) = CT min ($/A 7 + Z/Wj) > 0- min ($ / )a min (^/) - CT max (Z/Wj) 

The combination of the last three estimates, together with some simplifications, using the fact 
that both e and d(*$>, are smaller than e max < leads us to the final Lipschitz bound, 



max ||Pj(*)!/n||2 ~ max ||P 7 (*)y n ||l 



<d(¥,S).-^L (36) 
VI - ds 
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Next for Y n = m&x\n<g \\Pi(^f)y n \\2 we have Y n G [0,A] and therefore by Hoeffding's inequality, 

N 



1 - 

AT E E*S 11^/(^)^111 - K(max 



> i < e 



-Nt 2 /A 



By a union bound we can estimate that the above holds for all, at most (3e max /5) x ^ _1 - ) , elements 
of a <5-net Af for the set of perturbations with d($>, <3?) < e mSLX . We can now turn to the triangle 
inequality argument. For a perturbation ^ with d($>, <£) = e < e max we can find ^ G M with 
c?^, < 5 and <i('l f , = e. Analogue to the case 5 = 1 we then have 

1 N 1 N 

N E g(§ H P ^ - ]v E ^§ II 

> E j^max WPiWynWi) ~ E (max ||Pj(*)y n ||^ - 2t - *^)== 



7 



-2.4 > exp — — — , r — 2i — 5 



/:Pf(*)^Pr(*) 



where we have used the continuous equivalent of the estimate in ( |24| |. From Appendix B.2 



we 



know that for e < e max < 1 % we have 

64 v o 

9Q fi7 
ll^xlH - [|P/(*)*j||| > ~||Qi(<&)£/||| and 32||P / ($) - Pj(f )||| < ] 4^||Q / ($)P 7 ||| ; 

so we can continue the estimate above as, 

N „ TV 



^ E ^ ll p K*)ynlli - ^ E ^ II W)i/«lll 



> £ (^(5) iw,m*iii - ^exp ( w|q ; WJ>il |, )) - 2* - i^. 

As in the case 5 = 1 we now identify e max up to <5 by checking when the expressions in 
the sum above are larger than ||Qj(*)S/|||,. Following again the line of argument in 

Appendix |A.2 we get that 

2.4 exp f ~f { l~- S l ) < — (a _1 ||0/(*)B/|||., 



as soon as 

.2 



o,(*)fi,il,< M1 "' 5) 



671og(5(f)A/A 
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which is in turn implied by 

e/Vl-e < —/ r- or e< =— ^ — := £ max + S. 

~ 67V51og (5(f) A/x) ~ 68VSlog(5AKS/X) 



Thus as soon as e < M ^ ( ^ fl/A) ~ S := s max we have, 



N - N 



jf E s w Pi ^y^ -^Ek n^w^ni ^ ^(ST 1 E ii<m*wif - ^ - 6 jr=k- 

n=l n=l I 

To estimate the size of the sum over all possible supports we remember that bi = ^Zi where 

V>i = oiifyi + ujiZi with Zi) = and that maxj — <^>^ 1 1 2 = £■ We have 

(5 ) _1 E n^w^ii^ = g r 1 E (ii^m - wpmsiwl) 
1 1 

= (f)" 1 (5--i 1 )ii^ii^-(f)" 1 Eii(^r^iii 
>|ii«-(^)"Eii^i| 2 ii^i^ 

>^\\m%-Q-\ K 8 z^-s 8 )-^Bf F 

S ( A S-l\ - 2 S ( S \ 2 

V 1_ l-5 s i<r- J ~ \ a!(l — 5 s) ) E ' 



where in the last inequality we have used that 

II WF - 1 _£ 2 + £4/ 4 - 

With this last simplification we finally arrive at an estimate, which suggests the correct sizes 
for t and S, ie. 

1 N 1 N 

N E ft" " *F E s ll^WVnlll 



N^\i\<s" v 7 " z iV^|/|<s' 

n=l n=l 





AS 




> 


(* 




! 






XS 




> 








2K ' 





_S \ _ 2 _ _ 

5 \ _ 2 



- 2t - 6- 



d{l-5 s )J VT^ds" 
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We now choose * = N~*g (l - ^) and 5 = jV^ ^gg (i _ to get/ that 

except with probability, 

exp ^ + " 11 log J J ' 

we have 

1 V max ||Pj(*)y„||i - - V max ||Pi(* bJi > — (l- , S (e 2 - AN^), 
N^\i\<s n n Jy 112 JV^|/|<s" V jyn|l2 -2KV d(l-S s )J K 

n=l n=l ' 

which is larger than zero as long as e > 2N~ q := e m - m . The statement again follows from 
simplifications using e max < and verifying that e min < e max . □ 

Note that in order to get a more explicit result the abstract condition in ( |34| can again be 
replaced by a concrete condition in terms of the coherence ( |27| , and also the lower isometry 
constant can be estimated by 5s < {S — 
Let us now turn to a discussion of our results. 

6 Discussion 

We have shown that the minimisation principle underlying K-SVD can identify a tight frame 
with arbitrary precision from signals generated from a wide class of decaying coefficients 
distributions, provided that the training sample size is large enough. For the case S = 1 in 
particular this means that K-SVD in combination with a greedy algorithm can recover the 
generating dictionary up to prescribed precision. To illustrate our results we conducted two 
experiments. 

The first experiment demonstrates that the requirement on the dictionary to be tight in order to 
be identifiable translates to the case of finitely many training samples. For simplicity and to allow 
for a visual representation of the outcome it was conducted in M 2 . We generated 1000 coefficients 
by drawing C2 uniformly at random from the interval [0, 0.6], setting c\ = \J\ — c|, randomly 
permuting the resulting vector and providing it with random ± signs. We then generated four 
sets of signals, using four bases with increasing coherence and the same coefficients, and for 
each set of signals found the minimiser of the K-SVD criterion ([TJ with 5 = 1. Figure [l] shows 
the objective function for the case of an orthonormal basis, while Figure [2] shows the four signal 
sets, the generating bases and the recovered bases. As predicted by our theoretical results when 
the generating basis is orthogonal it is also the minimiser of the K-SVD criterion, while for 
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an oblique generating basis the minimiser is distorted towards the maximal eigenvector of 
the basis. Since for a 2-dimensional basis in combination with our coefficient distribution the 
abstract condition in ( [26] > is always fulfilled, this effect can only be due to the violation of the 
tightness-condition. 



ksvd-criterion 




Fig. 1. The K-SVD-criterion for the signals created from the decaying coefficients and an 
orthonormal basis, the admissible dictionaries are parametrised by two angles (6>i,0 2 ), ie. 
4>i = (cos 9i, sin 9i). 



The second experiment illustrates how the local minimum near the generating dictionary 
approaches the generating dictionary as the number of signals increases. As generating dic- 
tionary we choose the union of two orthonormal bases, the Hadamard and the Dirac basis, 
in dimension d = 4, 8, 16, ie. K = 2d. We then generated 2-sparse signals by first drawing c\ 
uniformly at random from the interval [0.99, 1], setting C2 = yl — c\, meaning c 2 G [0,0.1], and 
Ci = for i > 3 and then setting y = Qc ajP for a uniformly at random chosen sign sequence 
a and permutation p. We then run the original K-SVD algorithm as described in [lj, with a 
greedy algorithm, and sparsity parameter S = 1, using both an oracle initialisation (ie. the 
generating dictionary) and a random initialisation, on training sets containing 128 • 2 n signals 
for n increasing from to 7. Figure [3] (a) plots the maximal distance between two corresponding 
atoms of the generating and the learned dictionary, d(&, = maxj \\<pi — V'ilh/ averaged over 
10 runs. Figure [3] (b) is designed to be comparable to the experiment conducted for the noisy 
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H=0 ^=0.1045 




-1 - 



-1 -0.5 0.5 1 -1 -0.5 0.5 1 

Fig. 2. Signals created from various bases $ = {fa, fa) with increasing coherence \x, together 
with the corresponding minimiser * = {fa, fa) of the K-SVD-criterion for S = 1. 

^-criterion in IIT8I and plots the normalised Frobenius norm between the generating and the 
learned dictionary, ||$ — ^f\\p/V dK 3 , averaged over 10 runs. 

As expected we have a log-linear relation between the number of samples and the reconstruc- 
tion error. However our predictions seem to be too pessimistic. So rather than an inclination 
of — | we see one of — ^ indicating that d{&, « N~^. We also see that both the oracle and 
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7 8 9 10 11 12 13 14 '2 2.5 3 3.5 4 4.5 

log 2 (number of signals) log ]0 (number of signals) 

(a) (b) 

Fig. 3. Error between the generating Hadamard-Dirac dictionary <£> in R d and the output § of the 
K-SVD algorithm with parameter S = 1; the error is measured as *) = max; ||<^ - ipi\\ 2 ) in (a) 
and as ||$ - ^\\ F /VdK 3 in (b). 



the random initialisation lead to the same results, raising the question of uniqueness of the 
equivalent local minima, compare also [18J . 

Finally let us point out further research directions based on a comparison of our results for 
the K-SVD-minimisation principle to the available identification results for the ^-minimisation 
principle, 

min > I^Qi'l- (37) 

3>&>,X:Y=3>X ^ J 

y 

At first glance it seems that the K-SVD-criterion requires a larger sample size than the l\~ 
criterion, ie. N l ~ 4q /logN = 0(K 3 d) as opposed to 0{d?\ogd) reported in [17J for a basis 
and 0(K 3 ) reported in [14] for an overcomplete dictionary. Also it does not allow for exact 
identification with high probability but only guarantees stability. However this effect may be 
due to the more general signal model which assumes decay rather than exact sparsity. Indeed 
it is very interesting to compare our results to a recent result for a noisy version of the l\- 
minimisation principle, [18J, which provides stability results under unbounded white noise 
and, omitting log factors, also derives a sampling complexity of 0(K 3 d). 

Another difference, apparently intrinsic to the two minimisation criteria is that the K-SVD 
criterion can only identify tight dictionary frames exactly, while the ^-criterion allows iden- 
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tification of arbitrary dictionaries. Thus to support the use of K-SVD for the learning of non- 
tight dictionaries also theoretically, we plan to study the stability of the K-SVD criterion under 
non-tightness by analysing the maximal distance between an original, non tight dictionary with 
condition number y/B/A > 1 and the closest local maximum, cp. also Figure [2] 
The last research direction we want to point out is how much decay of the coefficients is actually 
necessary. For the one-dimensional asymptotic results we used condition c\ > C2 + 2/x||c||i to 
ensure that the maximal inner product is always attained at i p . However, typically we have 
\(4>i, &Cp^}\ c p (j) ± [i. Therefore a condition such as c\ > C2 + (/•*), which allows for outliers, 
ie. signals for which the maximal inner product is not attained at i p , might be sufficient to prove 
- if not exact identifiability - at least stability. Together with the inspiring techniques from IT5I , 
we expect the tools developed in the course of such an analysis to allow us also to deal with 
unbounded white noise. 
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Appendix A 



Technical details for the proof of Theorem [2JJ 
A.1 Expectations 

We start by calculating EpE a (|(^ p , <J>c Pj(T )| 2 ) for two arbitrary unit norm frames 5 r , <£. 

E p E a (|(Vv$c Pi(7 )| 2 ) = E p E (7 f|X^ c P»(VV^ 

= E E p( c ^)-i^'^)i 2 )- < 38 ) 

i 

For each i we now split the set of all permutations V into disjoint sets Vj k , defined as 

V) k :={p:p(i) = k,p(j) = l}. 
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We then have V = Uj tk Vjj k and 

(K-l)l if j = i and k = 1 
Fj fc = < (-ST- 2)! ifj^iandfc^l • 
else 

V 

Using these sets we can compute the expectations in ( |38) as follows 



3 k P£Pj k 

(K-2)\ 



Re-substituting the above expression into ( |38| finally leads to, 

„2 



E„E„ 



c 

A" 



^Ek^>i s 



(1 - c?) 



mr(jK" - 1) 

We can simplify the above result for three important special cases 
If $ is a unit norm tight frame, we have, 



®*nl -E k^^)!' 



c? 9 . (1-cf) 



e p e ct (|(^,$ Cp , CT )| 2 ) = ^ EK^>I 2 + , , 

if \& = <E>, we have, 



E p E (T (|( ( /) ip ,$c p , CT )| 2 ) = c 2 + 
and if $ = ^ is unit norm tight frame, we have, 



2 , (1-c?) ^||***|||. 



(if - 1) v # 



E P E CT (|(^ ; d>c Pi(7 >| 2 ) = c 2 + %-^(,4 - 1 )• 



A"- 1 



A.2 -Condition 



To complete the proof of Theorem 2.1 we still need to verify that e < 1/5 and 

(1 - 2/3) 2 



£ < 



2Alog(2AK/X) 
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imply 



2AKe W (- { - } A£2 m j ~ A( £ ? - 4/4) < 0, 

for all < Si < e, where we have used the shorthand (3 = ^jf^ 1 and A = cf — . Next ((39 
implies e 2 /2 < (1 - 2 /5) 4 /(Slog 2 4) < (1 - 2/3) /15, so we can estimate, 

6XP ( 2^T~ J * 6XP ( J 

■i2/ r om2 



< 



exp 



/ 14 2 (l-2/3) 2 -2Alog(2AR:/A) \ 
V 15 2 • 2A £i • (1 - 2/3) 2 J 

< exp (-log (2AK/X) • 14 2 /15 2 • 1/e;) . 

For two values a, b > we have ab > a + 6 as long as a > 6/(6 — 1). Setting a = log (2AK/X) 
and 6 = 14 2 /15 2 • l/e^ we see that this condition is satisfied for Si < e < 1/5, so we can further 
estimate, 

/ (i — §L — 2B) 2 \ 
exp f iXe^j ~ 6XP (-( lQ g( 2j4K / A ) + 14 Vl5 2 • l/ei)) 

= X/{2AK) ■ exp (-14 2 /15 2 • 1/e*) . 

As last step we will show that for < e < 1/5 we have exp (— 14 2 /15 2 • 1/e) < e 2 — e 4 /4 or 
equivalently that exp (l4 2 /15 2 • 1/e) > (e 2 — e 4 /4) _1 . Using a geometric series expansion we can 
estimate, 

I 1 I 1 ^ / e x 

72 ' 



e 2 - e 4 /4 e 2 1 - e 2 /4 e 2 



~e 2 + 4 + e 2 ^v4/ 

~ e 2 + 4 + 16^ V 4 J < e 2 + 99' 

i=0 v 7 

At the same time we can lower bound e a l £ , where a = (t|) 2 , as 



25 



15/ 

00 



"-sen 

i=0 



a a 2 a 3 a 4 

>1+- + ^T + TT^ + 



e 2e 2 6e 3 24e 4 
1 (a 2 5a 3 25a 4 \ 1 

>l + Y 2 U + ir + ^4- >1 + ^ 



leading to the desired inequality. 
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Appendix B 

Technical details for the proof of Theorem [371] 
B.1 Expectations 

We calculate E p E CT (||-f 5 / p ( , I') < l ) Cp i(J ||2) for two arbitrary unit norm frames whose spark is 
larger than S, ie. any subset of S vectors is linearly independent. 

E p E a (||P/ p (*)*^||i) = J> p (cJ w ||P/ p (#)&||!) (40) 

i 

For each i we now split the set of all permutations V into disjoint sets V l Jk , defined as 

V) k := {p:p(J) = {l,...,S},p(i) = k}, 

where J is subset of {1, ... , K} with \J\ = S and k = 1 . . . K. We then have V = Uj,kV l jk and 

(K — S — 1)\S\ if ii J and k > S + 1 
yPjk\ = { {K — S)\(S — 1)1 if i = j G J and A; = • 
else 

V 

Using these sets we can compute the expectations in ( |40) as follows 

E P (^)iiPi s (*)^ni) = ^EE E <£ii*m*)*ii2 

-1 1 /r^\ -1 



(ff ^5 E E 4ii^w^ni+ (ff | E E c lii^w^n 



J:i£J J.iGj 

Abbreviating 7 2 := cf + . . . + c| and re-substituting the above expression into ( |40) leads to, 

„2 „,2 



^E p E (T (||P Jp (*)$c p , a ||I) = ^^^ E ii^w^m + ^E E iiwmi! 

j J:i£J j J:iSJ 

E E ii'M^ii! + - ^5) E E n W)* 
££PM*)*B + (| - ^)EEpm*)* 

J i ^ ' J igJ 

A . _ s £ ii + (£ - sr=s) E i™*** 





- 5 


1 - 


-7 2 




- S 


1 - 


-7 2 


if 


- 5 


1 - 


-7 2 



ill! 



ill! 
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Since $ is a tight frame we have \\Pj(ty)<S>\\ F = tr($*Pj(#)*Pj(#)$) = tr(P/(»$$*) = AS and 
so we finally get 

E p E a (||P Jp W^HI) = ^"J^ + - ^) (f ) _1 £ ||P,(*)<Ml, 
which for = <J> reduces to 

B.2 Projection Pj(*) 



2 , (A -l)(l- 7 2 )5 
if - S 



We want to compute the projection Pj(^) = ^(^j*,/) -1 ^ or more precisely \\Pj(^)^>j\\ 2 F 
and ||Pj($) - Pj(*)||| for ^ = $j,4j + ZjT^j in terms of $j and Zj up to order 0(e 3 ). 
Note that Condition (j2Tj| implies that any subset of S atoms of $ is linearly independent. This 
means that j is invertible and we can write <I>j = (<I>j<1>j) _1 <3?j. (Ab)using the language 
of compressed sensing we denote the minimal eigenvalue of ^j^j by 1 — 5j(Q) and define 
5g($) := max| j|<5 <5j($) < 1, which is known as lower isometry constant. In the following we 
will usually omit the reference to the dictionary for simplicity. We first split \Pj into the part 
contained in the span of $j and the rest. Abbreviating Qj(&) = I d -Pj($) and Bj = ZjWjAj 1 , 
we have 

^j = Pj(^j + Qj(^j 

= QjAj + Pj($)ZjWj + Qj($)ZjWj 

= ($j(I s + tfjBj) + Qj{^)Bj) Aj. (41) 
Next we calculate (vpj^j) -1 . Using the expression in pi) we have 

= Aj ((I s + tfjBjy&jQjils + tfjBj) + B*jQj($)Bj) Aj. 



Since \\<f> f jBj\\ 2> 2 < \\<5>%,2\\Bj\\f < (1 - 5j)~ 1/2 fejeJ^/C 1 " < 1 we can calculate the 

inverse of (Is + ^j-Bj) using a Neumann series, ie. 

oo 

(Is + ^jBj)- 1 =I S + J2(-^jBjT, 

with || (% + $^Pj) _1 || 2 , 2 < (1 - ||$^B/|| 2j2 ) _1 . This allows us to rewrite as, 
= Aj(I s + *^Pj)*$5$j (Is + Rj) (Is + ^Sj^j, 
for Rj = {^j)-\l s + &jBjy- l B*jQj{<$>)Bj{l s + ^jBj)~\ 
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Using the identity 1 || 2 ,2 = ||$j||22 we can estimate 

11^112,2 < mj^jr%M\^s+^jBj)-% 2 \\Qj^)Bj\\i 2 

< \\Qj^)Bj\\l < E i6 j " S 2 ) 



For e small enough this is smaller than 1 and so we can again use a Neumann series to calculate 
the inverse, 

(^j)- 1 = Afils + ^jBj)' 1 (l s + (*J*jT\*s + tfjBjy^Aj 1 . 

Thus we finally get for the projection on the perturbed atoms indexed by J, 

Pj(*) = ($j + Qj(<S>)Bj(I s + ^jBjY 1 ) (is + Yji-Rj)^ (^j^j)" 1 (*J + Qj(*)Bj(I s + ^Bj)" 

To calculate ||-P/($) — Pj(^)\\p up to order 0(e 3 ) we need to keep track of all terms involving 
B j up to second order. We have, 

\\Pj(<!>) - Pj(*)\\f = tr(Pj(*)) - tr(Pj($)Pj(tf)) + tr(Pj(*)) 
= 25 - 2tr((<D}$ J )- 1 cD}vI/ J (^M/ J )- 1 vI/}$ J ) 

= 25-2tr (l 5 + ^(- jRj )M 



i=i 



i=i 



< 2||Q J (<D)P J ||| 



(ii^ii 2 ^-iipj||f) -wqjWBjw*. 

< t 0(||Q,(*)P^). 
I|3jllj,2 (11^112,2 -2||Pj||f) 
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Similarity we get for ||Pj(*)$j|||,, 
\\Pj(*)*jf F = tr($^ J (^* J )- 1 ^ J ) 
= tr U*j$j (l s + f2(-RjY 



i=l 

/ oo 



tr - tr f f I fl + B ~jQj(®) B J (is + W 

+ tr^$ J £(-i? J ) i j 
tr - tr ( J B}Q J (d>)B J ) - 2 tr ( l^Qj^Bj Bj)* ) 



i=l 



- tr t BJQ^JBj ^(-^jBjY + tr ^ ^(-^ 

\\i=l / i=l / V »=2 

which leads to the upper bound, 

00 00 
WPj^jWf < \\*Af - \\Qj{*)Bj\\% + 2||Q J (<I>) J B J ||^ £ II^HV + ||d>^ JjRj || F ^ 

PSll 2 , 2 -||^lb 11^11,-1(11^11^-2115^1^ 
= ll^m - \\Qj(9)Bj\\ 2 F + 0(\\Qj(P)Bj\\ a F \\Bj\\ F ). 

Appendix C 



Decay condition (25 i 



Here we sketch how to derive decay condition ( |25| . For simplicity we write I instead of I p . For 
any subset of S indices J I we have, 

||Pj(*)*x||| = \\Pi($)($mjxinJ + ^ijjxiij + Qj/iXj/! + *(/uJ)^(/uJ)0lll, 
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and therefore, 

l|Pj($)$x|l! - l|Pj($)$z|l! 

= W^i/jxi/jWI-W^j/ixj^wI 

+ wpjimj/ixj/iwi - \\Pj(mi/jxi/j\\ 2 2 

+ ||[Pr($) - Prnj($)]$(/uj)^(7uj)=lll - ||[Pj(*) - Prnj(^)]*(7UJ)«^(JuJ)c||i 

+ 2{P I ($)$j /I x J/I , *(/uj)cX (JUJ) c) - 2{Pj($)$ i/j x i/ j, ^(/uj )c x (/uJ)c ) 

> II^J/J^/j||l - ll^j/jxj/jlll 

- ||Pj($)*//ja;//j||l - || [PjW - P/nj($)]$(/uJ)^(/uJ)c||| 

- 2|(^//jx // j,$ (7u j ) cX (/uJ) c)| - 2|($j /j xj // ,$ (7u j ) cX (7uJ) c)| 

- 2|(P / ($)$ j// xj // ,% u j )c x {/uJ) c)| - 2|(Pj(*)*j /J xj / j,* (JU j ) cS (JU j ) c)|. (42) 

We now estimate all the terms in the last sum. We have 

\(®I/JXI/J,§(IUJ)*X( IU J)')\ < \\^(IuJ)c^I/J x I/j\\oo\\ x (IUJ)'\\l 

= .ma* (*//>j. a; //J>ll a; (JUJ)«l|i 
je(/uj) c 



< ^v / 5 r ^n||x // j|| 2 ||x( 7u j)c||i, 



\(Pj(*)*i/jXi/j,*(iuJ)°Z(iuj)°)\ = ie ™^ e (&>*j(*J*j) $ j^//J^//j)lk(/uJ)=lli 



< max IIWilblK^j) $ J $ //j||2,2|k//j||2|k(/uJ)c||i 

< fiVS-n ^ _ 1 ^ ||x J/ j|| 2 ||x( Ju j) ( :||i, 

and 

\\Pj{®)®i/jxi/j\\1 < ll^j(^j) _1 ll 2 , 2 ||^//j|ll 2 lk//j|li 

/i 2 5(5-n) 2 
^ 1 _ ( S H^/Jlla- 

To estimate || [P/($) — P/nj(^)]$(/uJ) cX (7uJ)<=ll2 we use ^e following relation between the or- 
thogonal projections, Pa(^), Pb($) and Paub(^)/ for two disjoint index sets A, B. For simplicity 
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we leave out the reference to the dictionary <E>. We have, 

oo oo 

PAUB = Ys^bYPa^S ~ P B ) + J2(P B PAyPB(I ~ Pa) 

i=0 i=0 

oo oo 

= P A + P B + (P A - I) ^(PbPaT + (Pb - I) J2(PaPbT- 

1=1 1=1 

Thus for a vector y we have 

oo oo 

\\(Paub - P A )yh < \\P B yh + £ IK^^ylb + £ II WiOVlk. 

i=l i=l 

Setting A = If]J,B = I/J and y = <&(iijjy x (iujy we can estimate the terms in the expression 
above as, 

fXy/S - n||x(/uj)c||i 



\\P B yh = 

\\{P B PA) l yh = 

< 
< 
< 

< 



i(^r^ii2<iiKrii2,2ii^yii 2 < 



t \*\ 



^l-(S-n-l)^ 



\(* B rh,2\\*B ((*im(*tr^) ,_1 (4rii2,2ii^ii 2 
i^ii^ii^^rii^ii^^niyii^ih 



^|| 2 ,2 (II^Alh^lK^A) -1 112,2)* (II^Bll^lK^^B)- 1 ^^)* 1 W&AVh 

\ i-1 



1 



n) 



n) 



< 



n-l)/x \(l-(n-l)/x)y ^(l-^-n-l)//) 
n ll :r (/uJ)<=lli n/i ^ fi 2 n(S-n) 



nVn\\ 



^l-(S-n-l) (1 - (n - l)/i) V(l " (n - %)(! - (5 - n - 1)//) 



X (/UJ) c Hl 
i-1 



IK^Wylb < Il4ll2, 2 (H^Blb^lK^B)- 1 IM 1 (H^aII^IK^a)- 1 !!^) 1 ||^y|| 2 

1 / [iy/n(S — n) \ 



< 



i-l 



n) 



< 



y/l-(n-l)n \(l-(S-n-l)fi)J ^(l-(n-lV) 
[iVS - n||x( 7u j)c ||i ^ fi 2 n(S-n) \ '--M 



fiVS - n||x(/uj)c||i 



V(l-(n-lV)(l-(5-n-lV) 
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This leads to the following bound for the difference of the projections. 

fiy/S- n||x (/u j)c ||i 



\\{Pmjb ~ PA)vh < 



/ npu , / pb 2 n(S—n) ^ V 2 

1 (l-(n-l)fi) l(l-(n-l)/Q(l-(g-n-l)p), 



fi 2 n(S—n) 



y x (l-(n-l)At)(l-(S-n-l)/i) 



^-"lk(Juj)°lli / 2(g-l)// (l-(n-l)/z)(l-(S-n-l)/j) 

- V + (l-(5-l)^) 1-(5-2) M -(5-1)m 2 



At\/5 - n||x (JuJ) c||i / 
Jl-(S-n-l)u V 



Vl-(5-n-l)A (1-(S-1>) 2 
Substituting all the estimates into ( |42| we get, 

HPi(*)*x|li-l|Pj(*)*x|li 

> (1 - (S - n - - (1 + (S - n - 1|| 

fi 2 S(S-n)., ||2 M 2 (S-n)||x (/UJ)c || 2 / 2(5-1)// ^ 2 



x 



l-(5-l)//" l-(5-ra-l)/x V (1 — (5 — l)/u) 2 

- 2fiVS - n||x//j|| 2 ||a;(/uj)-lli ~ Z^V S - n||xj//|| 2 ||a!(7uj)<=l|l 

- 2fiVS -n 1 _ ^ jj^ ||aj/j||2l|g(/uJ)«l|l ~ 2^V»S'-n i _ — ^- — ||aj//|| 2 ||a;(fuj). 

= a lk//j|l2 - 26||x//j||2 - c||o;j//||l - 26||xj/j|| a - d 

= (V^ll^b - V^) 2 " (v^ll^j/jlb + V^) 2 " ^ 2 /« + & 2 /c - d 

where 

fi 2 S(S - n) 



a = 1 — (S — n — 



l-{S-l)n 
S/j, 



IVY 



n 



b = S - n\\x( IUJ y\\i (l + - 
c = I + (S — n — l)/i 

Thus to have ||P/($)$x||| - ||Pj( < I ) )$x||2 > it is sufficient to have, 

(Va\\xi/jh ~ Vv 7 ^) 2 - (v^lkj/zlb + b/^~cf - b 2 /a + b 2 /c - d > 0, 
which is in turn implied by 



lk//j||2 > \\x J / I \\ 2 ^/c/a + b/y/ca + b/a + ^Jb 2 /a 2 - b 2 /(ca) + d/a > 0, 
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Using the bounds ||xj/j||2 > x/ S — ncs and ||xj//||2 < y/S — ncs+i we can further simplify to 



cs > cs+i\/ c/a + bj \fca + b/a+ \Jb 2 /a 2 — b 2 / (ca) + d/a > 0, 
For S/j, < 1/2 we have the bounds, 

and Jb 2 /a 2 -b 2 /{ca)+d/a < 2 ^\ X( - IUJ ^ 1 , 
leading to the final condition, 

C5> r^ cs+1 + r^ V N " 

^ ^ i>S+l 
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